DOCUMENT RESUME 



TM 830 025 

Choppin, Bruce 

Latent Trait Models for Answer-Until-Correct Tests. 
Methodology Project • 

California Univ., Los Angeles. Center for the Study 
of Evaluation • 

National Inst, of Education (ED), Washington, DC. 
Nov 82 

NIE-G-80-0112 
55p. 

Reports - Research/Technical (143) 
MF01/PC03 Plus Postage. 

Academic Achievement; *Computer Assisted Testing; 

Computer Programs; *Educational Testing; *Guessing 

(Tests); *Latent Trait Theory; Measurement 

Techniques; *Multiple Choice Tests; Research 

Me thodology ; Te s t I tems 

♦Answer Until Correct; Rasch Model 



The answer-until-correct procedure has made 
comparatively little impact on the field of educational testing due 
to the absence of a sound theoretical base for turning the response 
data into measures. Three new latent trait models are described. They 
differ in their complexity, though each is designed to yield a single 
parameter to measure student achievement. The simplest, a **partial 
credit** model, has a single difficulty parameter for each item. This 
model takes no account of the variations in distractor attractiveness 
from item to item, nor of which distractors were actually selected by 
the respondent. The second mc^del treats the test as a sequence of 
distinct steps, each of which has a difficulty parameter. This method 
does not assume that all items have the same logical structure with 
regard to difficulty. It takes no account of which distractors are 
selected. The third model is an extension of the second. In this 
model, the step difficulty values for an item vary in terms of which 
distractors were previously selected. A technical manual describing 
software developed for an effective and efficient program for 
administering answer-until-correct tests using microcomputer systems 
is reported as Appendix 1. (Author/PN) 



ED 224 831 

AUTHOR 
TITLE 

INSTITUTION 

SPONS AGENCY 
PUB DATE 
GRANT 
NOTE 
PUB TYPE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 
ABSTRACT 



*********************************************************************** 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original document. * 
*********************************************************************** 



Deliverable - November 1982 
METHODOLOGY PROJECT 



LATENT TRAIT MODELS FOR ■ 
ANSWER-UNTIL-CORRECT TESTS 



U.S. DEPARTMENT OF EDUCATION 

IMATIOIMAL IIMSTITUTE OF EDUCATION 
EDUCATIONAL RESOURCES INFORMATION 
CENTER (ERICI 
^This document has been reproduced as 
receivud from the person or organization 
onyinating ii 

Minor chcinges have been made to tmprove 
reproduction quality. 

• Points of view or opinions slated in this docu 
mont do not necessarily represent o(ftcial NIE 
position or policy, 



bruce Choppin 
Study Director 



0 



Grant Number 
NIE-G-80-0112, P3 



U.S. DEPARTMENT OF EDUCATION 

mTir"^"^^""^"^ OF EDUCATION 
EDUCATIONAL RESOURCES INFORMATION 
/ CENTER (ERIC) 

t/Th.s document has been reproduced as 
'ece.ved f.om the parsor, or organisation 
originating it. 

M.nor changes have been made to improve 
reproduction quality. 

• p°'"«ofvieworop;;;;;;;;Tt^;^^^^ 

ment do not necessarily represent official NIE 
position or policy. 



CENTER FOR THE STUDY OF EVALUATION 

Graduate School of Education 
University of California, Los Angeles 



The project presented or reported herein was performed pursuant 
to a grant from the National Institute of Education, Department 
of Education. However, the opinions expressed herein do not 
necessarily reflect the position or policy of the National 
Institute of Education, and no official endorsement by the 
National Institute of Education should be inferred. 



3 



LATENT TRAIT MODELS FOR ANSWER-UNTIL-CORRECT TESTS 



1. Introduction 

Though they are convenient to use and have some desirable 
psychometric properties, multiple choice tests have- been very widely 
attacked. Three specific criticisms that have been made against 
conventional multiple choice tests are: 

1) They they face the testee with three or four times as many 
incorrect statements as correct ones and provide no feedback 
to help the student learn the correct answers. 

2) That they encourage random guessing. 

3) That they are Inefficient and that little information is 
gained about the student from his response to a single item. 

The "answer-until -correct" testing mode (Brown, 1965; Hanna, 
1975) is designed to overcome these problems. In this mode the 
student is presented with instant feedback to a response. If the 
response is correct, the student is directed to continue to the next 
question, but if the response is incorrect he or she is asked to 
attempt the item again. This form of testing has the advantage of 
extracting significantly more information about a student's ability 
fro a given number of items, and thus makes it easier to distinguish 
between different levels of partial knowledge or part mastery. It has 
also been suggested that this response mode reduces the incidence of 



- 2 - 



random guessing behavior among students, and has the additional 
benefit that (most of the time) the final answer chosen by the student 
to an item is also the correct one. There is, a priori, reason to 
believe that this response, the one that receives positive 
reinforcement, is the one most likely to be remembered. 

A number of research studies have focused on the characteristics 
and usefulness of answer-until -correct testing. For example, Merwin 
(1959), Brown (1965) and Frary (1980) investigated various scoring 
procedures. None of the more complex alternatives they tried appeared 
to improve significantly on Brown's simple approach of reducing the 
total score by one point for every incorrect distractor selected. 
Hanna (1975), and Kane & Moloney (1978), investigated the implications 
of AUC responding for reliability and validity. Hanna suggested that 
the AUC procedure increased reliability but generally appeared to 
decrease validity as measured by correlation with a substantive 
external criterion. The implication is that testwiseness may play a 
more significant role in AUC tests than on conventional tests. This 
relates back to Merwin's earlier paper in which he concluded that if 
test constructors were to reap advantages from the AUC procedure, then 
item distractors would have to be carefully designed so as to relate 
in a clear way to the criterion variable. 

Much of the earlier work displayed considerable vagueness as to 
the presumed behavior of the student when taking a test. 

A careful reading and analysis of the logic presented suggests 
that the writers were assuming the relevance of one or the other of 
two contrasting and incompatible model >. The first, which may be 



ERIC 



- 3 - 



called the partial knowledge model , assumes that the student may know 
enough about the subject matter with which the item is concerned in 
order to be able to eliminate one or more of the distractors with some 
certainty* He is then presumed to guess at random among those that 
remain. Complete master of the problem involves-the certain 
elimination of all but one of the alternative responses so that the 
student chooses the correct answer without guessing. 

The second model assumes that a student arrives at an incorrect 
response not through some guessing procedure, but through the 
application of misinformation . Under the answer-unti 1 -correct 
procedure, such a student having applied his misinformation to obtain 
the wrong answer, is forced to choose again. The feedback that the 
first piece of misinformation is incorrect may be important incidental 
learning.. The next choice may be a random guess, or another response 
selected on the basis of misinformation. 

Frary showed that the AUC procedure was effective in 
discriminating between students when they operated on the basis of 
partial information, but suggested that the scoring procedure could be 
improved for students operating the misinformation model. Wilcox 
(1982) further considers the distinction between the partial knowledge 
and misinformation models and appropriate rules for scoring tests when 
the latter operates. Unfortunately, it would appear that in practice 
many individuals use both strategies when taking tests, and it is 
difficult to tell when looking at the pattern of results on which 
items they were employing partial knowledge and on which 



- 4 - 



misinformation* Questioning students following the administration of 
an AUG test could help to clarify this issue* 

The answer-until -correct procedure has made comparatively little 
impact on the field of educational testing in the seventeen years 
since Brown's paper for two reasons: 

(a) the lack of convenient and appropriate technology for 
providing instant feedback to the student, since clinical 
administration of tests is prohibitively expensive; and 

(b) the absence of a sound theoretical base for turning the data 
into measures, for while Brown's system appears to work in 
practice, there is no model to substantiate it or check its 
validity. 

On the first issue, there have been a number of recent 
developments. Answer-until -correct tests currently in use (on an 
experimental or regular basis) use one of three different feedback 
technologies. The first approach requires an answer sheet preprinted 
in invisible ink, so that when the student responds (using a special 
pen) a portion of the preprinted material becomes visible, and the 
student obtains the appropriate feedback. The second method involves 
having the student erase a shield printed over the top of the feedback 
information again on a specially prepared answer sheet. Each of these 
approaches requires some special equipment for preparing the answer 
sheets which have to be customized to fit a particular test. However, 
this equipment is now fairly generally available, and the answer 
sheets produced from it are not unduly expensive. 

The third approach involves testing by the computer. This method 
is potentially superior to the other methods because it allows the 



ERIC 



7 



- 5 - 



recording of the sequence in which particular responses are chosen. 
The first two methods described allow only the inference that the 
correct response was chosen last, but do not easily allow the earlier 
incorrect responses to be ordered, Urtil very recently the computer 
was far too expensive to be considered seriously" as^ a test 
administering device, but the rapid development of terminals and in 
particular of inexpensive micro processers opens up new possibilities. 

The computer is able not only to record the sequence in which 
distractors are selected, but also to accumulate other information 
(e.g., how long was the delay between each response), and continually 
update estimates of the student's level of performance and the 
measurement precision. It is also able to provide more or less 
detailed feedback under the control of the test constructor, and to 
provide the feedback in an entirely standard fashion so that no 
inadvertant clues are presented. During the last year, the CSE team 
has devoted considerable effort to developing an effective and 
efficient program for administering answer-until -correct tests using 
Apple microcomputer systems. We have designed this systai^ so as to be 
useful to teachers who currently have access to Apple or similar 
computers. The system has also been valuable in collecting 
answer-until -correct data for use in our psychometric research, and it 
records on disk, in a standard format, considerable information about 
the students' attempts at the test including his or her expressed 
confidence in each of the initial responses to each item. 

The technical manual describing the software we have developed to 
accomplish this is attached to the present report as Appendix 1, A 



ERIC 



6 



- 6 - 



somewhat simplified description designed to be used as a teacher's 
manual is currently in preparation. 

The rest of this paper will be devoted to describing the latent 
trait models which address the second of the problems mentioned 
earlier, the absence of a sound theoretical base. far turning the 
response data into a measure. 
2. Latent Trait Models 

Three new latent trait models will be described in the remainder 
of this paper. They differ from one another in their complexity, 
though each is designed to yield a single parameter to measure student 
achievement. 

The simplest, a "partial credit" model has a single difficulty 
parameter for each item. It is the latent trait analogue for Brown's 
(1965) integer scoring scheme based on the number of attempts needed 
to reach the correct response. The scoring is from 1.0 for a correct 
response on the first attempt to 0.0 for failure in (m-1) attempts, 
where there are m alternatives presented for an item. This model 
takes no account of the variations in distractor attractiveness from 
item to item, nor of which distractors were actually selected by the 
respondent. 

The second latent trait model treats the test as a sequence of 
distinct steps each of which has a difficulty parameter. A single 
five-way multiple choice item can be regarded as comprising four 
steps, with each successive step after the first being attempted if, 
and only if, the preceding one is failed. The scoring is 1/0 for each 
step, with steps not attempted being coded as incomplete data. This 




- 7 - 



produces four difficulty parameters for each item, but a single and 
more precise ability estimate for the individual. The method does not 
assume that all the items have the same logical structure with regard 
to difficulty, but it takes no account of exactly which distractors 
are selected. 

The third model is an extension of the second. In this model, 
the step difficulty values for an item vary in terms of which 
distractors were previously selected. Thus for a five-v/ay multiple 
choice item there is one difficulty parameter at the first step, foui 
at the second, six at the third, four at the fourth. This give a 
total of fifteen difficulty parameters for a single five-way multiple 
choice item. It should in general give a better fit than the model 
described above because it treats the distractors individually, but it 
requires more data for the necessary calibration of the item 
parameters . 

To some extent, the utility of these models is going to depend on 
the relative preponderance of the two styles of student behavior 
discussed earlier. Under partial knowledge , distractor elimination 
and random guessing (style A ) the noise introduced by guessing 
precludes the possibility of very precise measurement, and the first 
model described may well prove as effective as c-tt-her of the others. 
Where item responses based on correct information or mi sinformation 
( style B ) dominate, we would expect that models two and three would 
provide more precise and valid measures of student performance. 

Each of the models described is based on the simple one-parameter 
Rasch logistic model. This is for two reasons. Firstly, as argued in 

ERIC 



- 8 • 



a separate report to NIE, the Rasch model seems the logical choice in 
a situation which involves the construction of new test instruments, 
since it focuses attention on meeting the logical rquirements for 
objective measurement* Secondly, the main alternative, the three- 
parameter logistic model, has severe practical limitations even wHen 
applied to regular test data. Estimating techniques are primitive, 
and very large samples are required in order to obtain stable 
parameter estimates. The three-parameter model has been found useful 
in describing large bodies of existing data derived from tests of 
varied quality, but such data sets do not exist in the AUC format. 
Since obtaining sufficient data for adequate item calibration is 
anticipated to be a problem even for the Rasch model, it appeared 
sensible to concentrate initial efforts in this direction. 
Model (i): Fixed Pa rtial Cred it 

^ j^^'^:) 

The model is E(Xvi) = - — ~ 

where: E(Xvi) the expected score of person v on item i 

o^^ is a parameter describing the abiliuty of person v 
^. is a parameter describing the difficulty of item i 

m. -g . 

and the scoring function Xy-j = — ^ — 

m.- 1 
i 

where m-j is the number of alternative choices on item i (of which 1 is 

correct and (m-1) are incorrect) 
and gyi is the number of attempts by person v on item i until the 

correct alternative is chosen. If the (rni-l)th attempt 

fails then Xyi=0. 



ERLC 



- 9 - 



The rationale for this scoring scheme is based on a "partial 
knowledge" distractor elimination model. If a correct response is 
chosen at the first attempt, then it is assumed that the student was 
able to eliminate all the distractors, and so he or she gets full 
credit. If the first attempt fails, but the second attempt is 
correct, it is assumed that he or she could eliminate all the 

distractors but one, so that credit of .ULl- is awarded. (The number 

m-i 

of distractors is (m-1)). 

Although this equal -interval scoring function may appear somewhat 
arbitrary it is analogous to that frequently adopted in elementary 
scaling techniques (e.g., Likert scales). Moreover, Andersen (1977) 
has shown that for the model to retain specific objectivity, 
successive scoring categories must be equidistant. The immediate 
advantage of this is that the "raw score" by a student who has worked 
through the set of items is a sufficient statistic for the ability 
(and frequently may be used instead of it— hence the viability of the 
scheme proposed by urown). 

Parameter estimation is approached via a modification of the 
Rasch PAIR estimation algorithm (Choppin, 1982). For two items i and 
j, the relative difficulty can be estimated by 

where, on this occasion, b.jj is the sum, over all people in the sample, 
of X-j(l-Xj) and bji is similarly defined. (It can be seen that this 
reduces to the standard PAIR algorithm in the case of 1/0 scoring.) 

ERIC 



- 10 - 



Xi(l-Xj) represents the product of an estimate of the extent to 



which item i is mastered multiplied by an estimate of the extent to 
which item j is not mastered. It may be viewed, for each subject as a 
measure of the extent to which item i is easier than item j. The 
ratio: - 



which is why the accumulation of data over persons to estimate these 
expectations works. 

The algebra for maximum likelihood estimation, and for 
controlling the model via the squared matrix B* exactly duplicates 
that laid out in Choppin (1982), except that the formulae presented 
there for the standard errors of the ^-values are no longer 
appropriate. (Corrected formulae have not yet been developed, so the 
values reported by PAIR ^re used as conservative guides.) Once the 
items are calibrated, the estimation of person ability again follows 
the PAIR procedure. 
Model (ii): Step Calibration 

In this model, the probability of person v responding correctly 
to item i at the gth attempt, given that he or she makes the attempt, 
is: 






a value independent of o( 



is 



ERIC 



- 11 



Prob.j^Xvig = 1^ = 



where X^^g = 1 if the gth attempt at item i is successful, and 
= 0 otherwise 

oC^ is again a parameter describing the.abdlity of person v 
and is a parameter describing the difficulty of the^ th step on 

item i. 

For a five-way multiple choice item there are five possible sets 
of observation vectors X^, with asterisks indicating missing data 
(i.e., attempts that do not occur). 





g = 


1 


2 


3 


4 


Correct 


at 


first attempt: 


X = 


1 


* 


* 


* 


Correct 


at 


second attempt: 


X = 


0 


1 


* 


* 


Correct 


at 


third attempt: 


X = 


0 


0 


1 


* 


Correct 


at 


fourth attempt: 


X = 


0 


0 


0 


1 


Failure 


at 


fourth attempt: 


X = 


0 


0 


0 


0 



If the raw data to be analyzed consists of code numbers for the 
successful attempt on each item, then it must be transformed into the 
above format for the calibration analysis. For example, suppose that 
an individual required (2, 1, 1, 4, 5, 3) attempts to find the correct 
answers to a six item five-way multiple choice test. The recoding of 
this vector would yield: 



0 1 



* * 



* * * 



★ * ★ 



0 0 0 1 



0 0 0 0 



0 0 1* 



ERIC 



14 



- 12 - 



a vector of 24 elements. A set of such vectors from the different 
persons attempting the test- can be. analyzed almpst as a standard Rasch 
model problem— providing the PAIR algorithm (Choppin, 1982) is used to 
allow for the embedded missing data. The deviation from the standard 
Rasch procedure is necessitated by the viol ation ^of- the local 
independence assumption for AUC data. While it remains important that 
between items this independence is maintained, it is clear that within 
an item the different X-values cannot be independent. As shown above, 
only m possible patterns out of the 3fn theoretically possible on each 
item ever occur and certain combinations such as (1,0) are impossible. 
This invalidates the maximum likelihood estimation procedure which 
assumes that the elements of the B matrix for item pairs are 
essentially independent. 

The full theoretical implications of this are still being 
explored, but a convenient "fix" in order to calibrate the items is to 
use instead of ML a least squares procedure based on a modified B* 
matrix. This B*, instead of being simply the square of matrix B as 
before, is now screened to remove the contaminating dependence within 

i terns . 

In the standard PAIR algorithm 

k 

and since b^.^- = bjj = 0, b*^^ is independent of b^-j. 
In PAIR as modified for AUC tests 

k 

= Z^ik^-k^kj^kj 
where Vik are the elements of a screening matrix such that 

Vpq = 0 if responses p and q relate to the same item 
and Vpq = 1 otherwise. 



- 13 - 



Least squares estimation procedure applied to the B* matrix yields 
calibrations for the - values (i - l,k ; g = 1, m-1). 

The estimation of person ability, the usual goal in such 
exercises, is somewhat different than in the standard Rasch model. 
Apart from rare failures at the final attempt, each student will Score 
one point on each item and thus will have a raw score of k. 

However, this raw score will be based on different numbers of 
"attempts", and individual step difficulties will be higher on some 
items than on others. Therefore t><^ is estimated by the solution of 



where the summation 



extends over the item steps actually attempted, and is the 
observed raw score (usually k). This equation can always be solved to 
produce a unique LS estimation of (X^ , but may be inefficient since 
its (iterative) solution is required for each observed score pattern. 
Monte Carlo simulation could compare the variation in 0^ with the 
scoring function proposed by Brown (1965), to see whether the exact 
Iterative solution is worthwhile. 

The standard errors of such estimates depend upon the number of 
attempts made. Thus someone who usually responds correctly at the 
first attempt will be measured with less precision than someone who 
typically requires two or three attempts. Data in which the mean 
number of attempts per item is 2.0 (a typical value) will yield 
standard errors of measurement only 0.7 times as large as with a 



- 14 - 



conventional test with the same number of items. From this it can be 
seen that major increases in precision can only be achieved by 
substantially increasing the number of alternatives per question, so 
that the number of attempts made before success will also increase. A 
valuable experiment would thus be to try this prgeadure on a test^for 
which each item had eight or ten alternatives. This has not yet been 
done. 

Model (iii): Distractor Calibration 

This model is an extension of (ii) to allow for differences among 
the distractors. The item step difficulty parameter now describes the 
difficulty of the item at each step taking account of which 
distractors have already been eliminated. 

Thus Indicates the difficulty of item i at the initial 

step when all distractors are present 

S indicates the difficulty of item i at the second 
step when distractor A was chosen at the first 
X ^ indicates the difficulty of item i at the third step 
after distractors B and C have been chosen (in 
whatever order) 
With this notation, the model becomes 

e 1 

The analysis and estimation procedures essentially follow those 
for model (ii) except that the response data must be coded in 



Prob 



ERLC 



17 



- 15 . 



different format. For a five-way item (for which the corrent respone 
is E, and the distractors are labeled A-D), the structure of the 
parameters to be estimated is: 



Response data for an individual who chose responses A, C , F, in 
that order, getting the item right at the third attempt, would be 
coded 



* * * 



* * * * 



* * * * 



It should be noted that this coding scheme is severely 
constrained. There is at most one entry in each block, and a "1" 
entry effectively terminates the vector. Thus the range of possible 
response patterns is limited, and again the local independence 
principle is violated. 

Estimation procedures can follow the sequence described in model 
(ii) first to calibrate the item step values, and secondly to estimate 
the person ability parameters. However, it is apparent that the 
procedure is somewhat unwieldy. For each item the number of 
difficulty parameters to be estimated is given by (2"^""^ - 1) where m 
is the number of alternative responses in the item format. Inadequate 
calibration of the parameters due to insufficient data can spoil the 
overall measurement of person ability (viz: person measurement with 



ERIC 



18 



- 16 - 



the Lord-Bi rnbaum three-parameter model and small data sets). A six 
item five-way multiple choice test such as that described under model 
(ii) would require the estimation of 90 item difficulty parameters 
under model (iii) as opposed to 24 under model (ii). For this model, 
in contrast to model (ii), it would seem wise to^restrict item formats 
to not more than threeor four alternatives. 
3. Trial Data Analysis 

Calibration procedures for models (i ) and (ii) have been 
programmed in FORTRAN using variations of the PAIR algorithm described 
above. Both programs have demonstrated their ability to recover the 
parameter values used to generate artificial "fitting" data. Two data 
sets from AUG tests each comprising several hundred cases have been 
analyzed using these programs* One test is a junior high school 
science test under development in England. The second is a college 
level psychology test used in a private California university. The 
results are still being studied. 

Model (iii) requires the coding of which distractors were 
selected in which sequence , and this is only practicable with a 
clinically administered or computer administered test. For this 
reason we have devoted considerable time to developing a software 
package that will administer AUC tests in schools, and store the 
results in a format suitable for aggregation and subsequent analysis. 
Details of this package are given in the Appendix. 



. 17 - 



REFERENCES 

Andersen, E.B. Sufficient statistics and latent trait models. 
Psychometrlka , 1977, 42, 69^31. 

Brown, J. Multiple response evaluation of discrimination. The 

British Journal of Mathematical and Statistical Psychology , 1965, 
18, 125-137. 

Choppin, B.H. A fully conditional estimation procedure for Rasch 
model parameters. Draft report to NIE, 1982. 

Frary, R.B. The effect of misinformation, partial Information, and 
guessing on expected multiple-choice test Item scores. Appl led 
Psychological Measurement , 1980, 4^, 1, 79-90. 

Hanna, G.S. Incremental reliability and validity of multiple-choice 
tests with an answer-untll -correct procedure. Journal of Ecu 
tional Measurement , 1975, 12, 3, 175-178. 

Kane, M., & Moloney, J. The effect of guessing on item reliability 
under answer-untll -correct scoring. Applied Psychological 
Measurement , 1978, 2, 1 , 41-49. 

Merwin, J.G. Rational and mathematical relationships of six scoring 
procedures applicable to three-choice items. Journal of 
Educational Psychology , 1959, 50, 4, 153-160. 

Wilcox, R.R. Some new results on an answer-untll -correct scoring 

procedure. Journal of Educational Measurement , 1982, IJ^, 67-74. 



2u 



APPENDIX 



2i 



ERIC 



INTERACTIVE COMPUTER PROGRAMS FOR 
CONFIDENCE-MARKING AND ANSWER-UNTIL-CORRECT TESTING 

Raymond Moy and Chi h -Ping Chou 
Center for the Study of Evaluation, UCLA 

Introduction 

In traditional scoring of multiple-choice tests, an item score of 
one is given if the examinee selects the correct answer, and zero if 
any other alternative is chosen. The problems with such a score 
assignment procedure are twofold. On the one hand, because of the 
limited number of distractors available, it is possible for an 
examinee to obtain a score of one simply through random selection of 
an alternative and without any knowledge of the correct answer. On 
the other hand, many students with partial knowledge will receive a 
score of zero even though they are able to reduce the number of answer 
alternatives to a smaller subset than those originally presented. 
Assuming that the correct answer is included in this subset, such 
students do not deserve one point full credit if they guess the 
correct answer, nor do they deserve a score of zero if they miss it. 
A more accurate score, reflecting their state of partial knowledge 
lies somewhere in between. The net result of the zero-one method of 
scoring is a reduced efficiency of measurement, because reliability is 
decreased from having assigned ones to students who do not really know 
the answers, and zeros to those who have partial knowledge. 



2-4 



It may, therefore, be possible to improve upon traditional 
zero-one scoring if some method could be devised to obtain more 
detailed information about the examinees' state of partial knowledge. 
Although it might be possible to have an examinee give rationales /or 
choosing a particular answer alternative, this is not practical in 
large scale testing efforts, nor will it be easy to assign objective 
partial score credit to such open-ended responses. Instead, various 
objective techniques have been suggested which may yield useful 
information. Among these techniques are elimination scoring, 
confidence marking, and answer-unti 1 -correct. All of the techniques 
are based on examinee interactions with the item distractors, or 
obtaining information about how examinees view the correctness of 
their answer choices. 

In elimination scoring, the examinee is asked to indicate those 
alternatives which he or she thinks is definitely Incorrect. A score 
of one is assigned if, and only if, all distractors are correctly 
eliminated and partial scores may be assigned on a weighted basis for 
correctly eliminating some of the distractors. Various methods for 
assigning partial credit have been proposed (e.g., Arnold & Arnold, 
1970; Coombs, 1953; or Cross & Thayer, 1979), however, all methods are 
rather arbitrary since none are based on explicit descriptions of the 
relationship between choice of distractors and the ability of 
interest. The methods differ, though, in how they deal with the 
possibility of guessing behavior and misinformation (i.e., eliminating 
the correct answer as wrong). 



23 



- 3 - 



Aside from the problems of deciding partial credit scores for 
various types of elimination patterns, there is also a significant 
problem in getting examinees to respond properly to the task. There 
is a tendency among examinees to be much too conservative when faced 
with expressing their confidence in their answers' (e.g. , Ebel , 1968; 
Hritz & Jacobs, 1970). If this is the case, then the ability 
estimates from this procedure may be negatively biased. 

Confidence marking procedures require the examinee to either 
select a correct answer and provide a confidence judgment in the 
answer (as exemplified in studies by Shaughnessy, 1979; Sieber, 1979) 
or to assign probabilities of correctness for each answer alternative 
(Koriat, Lichtenstein, & Fischhoff, 1980; Rippey & Donato, 1978). 
These confidence markings can then be used to score examinees' partial 
knowledge of individual items. 

Like elimination scoring, the validity of confidence marking 
procedures depends on the examinetis' responding properly and 
accurately to the task. Personality characteristics which lead to 
expressions of over or under confidence would be problematic, as would 
be variation across examinees in the interpretation of specific 

confidence ratings. 

The answer-until -correct technique avoids requiring examinees to 
make judgments for individual answer alternatives and instead allows 
the examinees to select what they consider to be the correct answer 
and to continue choosing among the distractors until the correct 
answer is selected. The number of attempts an examinee takes before 
reaching the correct answer is taken to be indicative of the extent of 
the examinee's partial knowledge. 

ERIC ^ 



- 4 - 



In contrast to the confidence marking and elimination procedures, 
AUC testing requires some method of providing feedback to the 
examinees that tells them whether their answers are correct or not* 
This means that either special answer sheets or individualized testing 
sessions would be required. 

Aside from these logistic problems, there is also difficulty in 
.interpreting the relationship between number of attempts and the 
ability of interest. Although it is commonly agreed that the fewer 
number of attempts the greater the partial credit that should be 
awarded, an overall scoring algorithm which will maximize scaling 
validity has not yet been devised. This is due to the fact that 
information regarding the relative difficulty of distractors needs to 
be specified and, as of yet, item writing technology is not refined 
enough to accomplish the task. 

In practical applications of AUC testing, it has been the 
practice to simply use the number of attempts as the basis for 
scoring. Whereas Gillman and Ferry (1972) found that split-half 
reliability for this method of scoring was substantially increased 
over zero-one scoring, Hanna (1975) and Taylor, West, and Tinney 
(1975) found little or no improvement. One possible resolution to 
these conflicting findings is that improvements through the use of AUC 
scoring are dependent on the properties of the items and their 
distractors. Kane and Maloney (1978) have shown that when all but two 
distractors are eliminated as incorrect by all examinees, and when 
random guessing takes place among the n-1 alternatives, zero-one 
scoring is more efficient. 

o 2d 
ERIC 



In contrast to the approach of assigning partial credit on the 
basis of the number of attempts, Wilcox (1981, 1982) proposes using 
AUG information to yield correction for guessing estimates. Under 
this conceptualization, the ability of interest is the proportion of 
items an examinee is able to answer correctly, wtth~ no credit for 
partial knowledge. Howaver, partial knowledge will affect the 
probability of getting an answer correct through guessing, and it is 
this probability which is estimated from AUG information. 

Whether one chooses this latter conceptualization of ability or 
the partial credit conceptualization is a question of the meaning of 
one's scale and is not a matter of one being correct and the other 
incorrect. Quite simply, they are two different ways in which AUG 
information can be utilized to improve on zero-one scoring. 

In order to more fully investigate the value of AUG information, 
substantial amounts of data must be gathered and the logistic problems 
of providing AUG feedback to the examinees must be solved. Toward 
this end, an interactive program was developed to follow an AUG 
format. The program was designed to allow AUG testing on a number of 
different tests to students of a wide range of ability levels. The 
rest of this report will describe in greater detail the overall design 
of the program, the options available in a typical program run, the 
mechanics of inputting new tests, and the production of output for 
data analysis. 



2b 



The AUG Program 



Three programs have been developed for gathering A-U-C test data 
(1) the AUCMAIN program for administering the tests, (2) a test FILE 
WRITER program for creating new tests as input td^ AUCMAIN, and (3) a 
CONCATENATION program for creating a single data file containing 
responses from all students being administered a particular test. 
Figure 1 shows how these three programs are related to each other. 



.27 



Figure 1 



Interrelationship of AUCMAIN (I), FILE WRITER (II), 
and CONCATENATION (III) programs. 



II. 



I. 

AUCMAIN 



FILE 
WRITER 



teacher 
session 



student 
session 



After each 
student session, 
control returned 
to teacher 



Test 



1 



Creates 



selection 



^ □□□□□□ □ 



Individual test files for 
input to AUCMAIN ' 

I . I 



Selected test 



administered to students 



L. 



Output 

□□□□□□ 

student Response Files 



III. 



1 


Input 




CONCATENATION 

Program 


1 

J 



Output 



single data file 
containing responses 
of al 1 students 



28 



- 8 - 



The AUCMAIN program . The AUCMAIN program contains two sections: 
the first section is designed to interact with the teacher 
who is given a decription of administration procedures and requires 
teachers to specify session parameters which will identify and control 
the administration of tests to students* The seqand section is the 
actual test session controlled by the examinee. 

Teacher session . The AUCMAIN program is self -booting once the 
AUC disk is mounted and the computer turned on. The screen will show: 



COPYRIGHT 1982 
rREGENTS OF IffllVERSITY OF CALIFORNIA^ 
ANSWER UNTIL CORRECT • 

CONFIDENCE MARKING 
. TESTS- 




-HIT- (RETURN) 



Teachers should then hit the <RETURN> key to view the next screen; 



' PROGRAM WRITTEN BY 
^RAYMOND MOY AND CHIH-PING CHOU 
WITH ASSISTANCE iJ-ROM 
GINETTE DELANDSHERE 
UNDER THE DIRECTION OF 
DR. BRUCE CHOPPIN 

[center for THE STUDY OF EVALUATION 
\UNIVERSITY OF CALIFORNIA, L.A. . 

... HIT (RETURN) 



29 



- 9 - 



After the <RETURN> key is hit again, the program will ask whether 
teachers would like to have a description of the AUC testing 
technique: ■ • 



fiF YOU WISH TO SKIP THE DESCRIPTIVE 
[INFORMATION ABOUT AUC TESTING, 
ItYPE in the word 'SKIP' 
JaND hit THE (RETURN) KEY 
THERWISE, just HIT THE . 
(RETURN) KEY ALONE 




If the word 'skip* is entered the program will proceed to the test 
selection screen. Otherwise, AUC descriptive information is presented 
on the following screens. Teachers hit the <RETURN> key to proceed 
from one screen to the next. 



„iIS TEST PROGRAM WAS DESIGNED TO 
/obtain more INFORMATION FROM 
'STUDENTS' MULTIPLE-CHOICE RESPONSE 
' THAN IS AVAILABLE FROM TRADITIONAL 
I RIGHT/WRONG SCORINGS 

... HIT (RETURN) ■. 



- 10 - 



'all this test information will be 
r stored & -later malyzed for reli- 
ability and validity, 
before the first student begins , 
we "need you to provide some inform- 

, ATION. 

boR EACH QUESTION, TYPE .IN YOUR 
SSPONSE AND THEN HIT THE (RETURN) 
iKEY. 

... HIT (RETURN) 



r T?dE STUDENT IS PRESENTED WITH A 
SERIES OF TEST ITEMS WHICH HE OR SHJ 
(responds to UNTIL THE 
CORRECT ANSWER IS CHOSEN. 
ALSO, STUDENTS ARE ASKED TO RATE 
[their level of confidence in THEIR 
^SWERS. 

... HIT (RETURN) 



o 



Following the AUG descriptive information, the test selection 
screen is provided. Teachers are asked to select one of two test 
sets: Language Arts or Science/Math. 




A). LANGUAGE ARTS 

Bl SCIENCE/MATH 

WHICH SET would YOU LIKE 
ADMINISTERED? 

—U 



- 11 - 



The first set. Language Arts, consists of six tests: 




The se cond set^ Science/Math, contains four tests: 



WE HAVE THE FOLLOWING TESTS 
AVAILABLE: 

(1) SCIENCE 

(2) ARITHMETIC 

(3) MATH II 

(4) MATH 

WHICH TEST DO YOU WANT ADMINIS 
TERED? 

ENTER TEST NUMBER: 



HIT (RETURN 




After a test is selected, teachers are requested to provide 
Information which will be used to help identify student response 
files. 



■ 32 



- 12 - 



First, a teacher's last name is requested: 




then the school name: 




The AUC program will then contirm all the information input as 
f ol 1 ows : 



33 



- 13 - 



^CORDING TO THE INFOIttlATION 
YOU HAVE ENTERED: 
YOUR NAHE IS . . name . . ' . 
AND the' NAME OF YOUR SCHOOL IS 
' . . school name . . ' 
THE TEST YOU HAVE CHOSEN IS 
test name. 
IS ALL OF THIS INFORMATION CORRECT; 
TYPE (Y) FOR 'YES' OR (N) FOR"- 'HO' 

... HIT (RETURN) 



Teachers can type <Y> to confirm the information and proceed with the 
student session. If corrections are required, teachers should type 
<N> and hit <RETURN>. The screen will then print out the following 
question: 



fENTER WHICH TYPE OF INFORMATION 
YOU WISH TO CHANGE. (ENTER '.TEST', 
'NAME', OR 'SCHOOL' .AND HIT RETURN 
KEY) • 




... HIT (RETURN) 




For example, if the teacher wishes'ton change the test selection, 
he or she should type 'test' and then hit <RETURN>. The program will 
go back to the test selection session, and then present the 
information again for confirmation. The program proceeds to the 
student session after all the information is entered correctly. 



ERIC 



34 



- 14 . 



The student test session begins after the following messages; 



^THE COMPUTER IS READY FOR THE 
FIRST STUDENT. 
BEFORE THE STUDENT ANWERS 
THE TEST ITEMS, SOME PRELIMINARY 
QUESTIONS WILL BE- PRESENTED 
TO LET HIM OR HER SEE HOW 
THE COMPUTER WORKS. 
PLEASE HIT THE (RETURN) KEY TO 
BEGIN THE STUDENT SESSION 

... HIT (RETURN) 



The computer will then load the selected test and asks teacher to 
stand by while this is being completed. As each question is read into 
the computer's memory, a beep will be heard: 



PLEASE WAIT WHILE THE 
testname TEST IS 
BEING SELECTED & READ 
INTO THE COMPUTER.' 



THE TEST ITEMS ARE STILL BEING 
READ IN. 



THE STUDENT CAN BEGIN IN A FEW 
SECONDS AFTER THE BEEPING STOPS. 




Student session . After the test nas been read in, students are 
requested to provide information which will be used for identification 
purposes. Also during this time, students will have an opportunity to 
get acquainted with the computer and learn how to interact with it. 
Student are asked for their names, birthdates, and grades: 



3o 



- 15 - 



HELLO! WELCOME TO OUR COMP.UTER 
QUIZ. " 



PLEASE TYPE YOUR FULL NAME AND 
THEN HIT THE (RETURN) KEY. 




PLEASE TYPE. YOUR- GRADE. 

FOR EXAMPLE, '6', '9', OR '12'. 
(IF YOU ARE A TEACHER, TYPE 'T'). 



.. HIT (RETURN) 



o 



• 16 - 



If <T> is typed, no response file is created at the end of the 
session. 

Students will then get a short description of the test that is 
going to be administered. Using the ESL I test as an example, the 
student will see the following screen: 




IN THIS QUIZ, YOU WILL BE ASKED 10 
ESL I QUESTIONS. 
AFTER EACH QUESTION WILL. BE 5 
LETTERS, EACH WITH AN ANSWER 
FOLLOWING IT. 

YOU MUST READ ALL* THE ANSWERS, 
AND. TYPE IN THE LETTER. OF THE BEST 
ONE. IF YOU ARE READY, HIT THE 
(RETURN) KEY. 




When the student is ready and hits the <RETURN> key, the 
directions for the ESL I test are presented: 



fDIRECTIONS: 
READ EACH QUESTION AND SELECT. THE 
ANSWER WHICH WOULD GO IN THE BLANK 
( • ) AND BEST COMPLETE THE 

MEANING OF THE SENTENCE. 

* 



/ o 

. . . HIT (RETURN) / 



17 



The first item of the test will then be presented; 



/Ql. DID YOU TELL JOHN WHERE ( \ 




/ GONE? % 






' (A) SHE 






(B) HAD SHE 






(C) SHE HAD ' 






(D) HAS SHE 






i (E) (NO WORD IS NEEDED) 






\ WHICH IS THE CORRECT ANSWER? / 


0 


\ A, B, C, D, OR E? J 



In this program, students have as many chances as they need to 
answer an item correctly. Each time an answer is provided, the screen 
will present the answer just made, and allow students to make changes 
if desired. For instance, if answer <A> is chosen, the program will 
print out the following statements on the screen: 





fYOU AHVE CHOSEN ANS\«:r (A). 
ARE YOU HAPPY WITH THIS ANSWER? 
IF SO, TYPE (Y) FOR 'YES', 
OR ELSE TYPE (N) FOR 'NO', 
AND YOU CAN CHOOSE ANOTHER ANSWER. 



ERIC 



38 



- 18 - 



If students type <N> at this point, the question is presented again 
along with the available choices. For the first attempt of each item, 
students are asked about the level of confidence in their answer: 




For subsequent attempts, the confidence-marking part is skipped. 

After each response, students will receive feedback on whether 
they are correct or not. If the answer is correct, the next item will 
be presented. On the other h.:ind, if the answer is wrong, students 
stay on the item. For each additional attempt, the answers previously 
selected are eliminated from the distractors remaining for that item. 
The answer-until -correct procedure can be illustrated by the following 
flow chart: 



- 18A« 



i =0: New Item 



Test item with 
(t-i) distractors 



i = i + 1 



Student's i^h trial 




Answer provided 
on i^^ trial 
is correct 



Answer provided 
on i^h trial 
is wrong 




NO 



YES 



i: number of trials attempted by student 
t: number of total distractors in an item 



ERLC 



4U 



Using our example test item, this would proceed as follows. First a 
new item is presented: 




Q.l DID YOU TELL JOHN WHERE ( 
GONE? 

(A) SHE 

(B) HAD SHE 
SHE HAD ' 
HAS ^HE 

(NO 'WORD IS NEEDED) 

WHICH IS THE CORRECT ANSWER: 
A, B, C, D, OR E? . 




If answer <A>, which is incorrect, is selected, the student will see 
the same ite m without distract or ^<A> after a short _Baus^i__ 



Q.l DID YOU TELL JOHN WHERE ( 
GONE? •■ 

(B) HAD SHE 

(C) SHE HAD 

(D) HAS SHE •' 

(E) (NO WORD IS NEEDED) 

WHICH IS THE CORRECT ANSWER: 
A, B, C, D, OR E? 




For the subsequent "trTals, distractors wll I be "exc'lucied from the 
available choices after they are selected. 



4i 



. 20 - 



Students are allowed to proceed to the next Item under the 
following conditions: (1) the present item is answered correctly, (2) 
all the incorrect answers have been chosen, or (3) the response time 
is longer than the time limit allowed, which is 120 seconds. 

At the end of the test session, the screen will present a summary 
of the test results. For example, one student's resuUs might be 
presented as follows: 



These results will remain on the screen for about 45 seconds. After 
the elapsed time, the screen will then show the following message 
while the computer clears out old variable values from memory and 
stores student's responses on disk: 




o 



o 



- 21 - 



The <RESET> key should never be touched during this stage, otherwise 
the data of the student who just finished the test will be ruined. As 
soon as the data is saved, control of the program returns to the 
teacher. The teacher then has 3 optivns: (1) to run another student 
on the same test. (2) to select another test, or (3) to end the 
program. These o ptions are presented as jfpJ.lows2_ 



IF THE NEXT STUDENT IS READY 
rTYPE 'RUN' AND HIT THE (RETURN) KEY] 

IF NEW TEACHER INFORMATION 
NEEDS TO BE ENTERED, 
OR A NEW TEST IS TO BE SELECTED, 
TYPE 'NEW' AND HIT THE RETURN KEY, 

IF r.lE TES SESSION IS OVER, 
^TYPE 'END'. 



If 'RUN* IS typed, the program will go oacK to tne student 
session If 'NEW is typed, the program will go to the very beginning 
of the program when the teacher is asked to supply new parameters for 
a program run. The AUC program can be stopped by typing END . 

Another feature of the AUC program is the detection of whether 
there is enough space to store student data. If the disk is full, the 
following messages will be presented to the student: 




ERJC 



43 



. 22 - 



This message remains on the screen for 60 seconds and then the teacher 
will receive the following message: 



THE DISK IS FULL AND NO MORE TESTS' 
CAN BE CONDUeXED. PLEASE SEND THE 
DISK TO THE tlENTER FOR. THE STUDY OF) 
EVALUATION AS SOON AS POSSIBLE. 



THANK YOU, 



In addition to the AUC response patterns and confidence level 
responses, the AUC program also keeps track of the time it takes a 
student to respond to each distractor. The maximum time recorded for 
each response is ninety-nine seconds* This should be adequate for 
most examinees since the average response time on the first trial is 
less than fifty seconds. 

One last feature installed in the AUCMAIN program is that it 
allows teachers to interrupt a test when they feel the test being 
administered is inappropriate. If the teacher holds down the <CNTRL> 
button, and hits the <F> key at the same time right after an item is 
presented, the following question appears on the screen: 




If the teacher responds 'YES*, the program skips down to the last step 
of the program where teacher is given three options of running a new 
student, selecting a new test, or ending the program. If the teacher 
responds •NO*, then the program proceeds with the last question 
presented, r . 



- 23 - 



Output Files 

After a test is administered to a student, an output file is 
created and named with the following format: 

TESTNAME STUDENTNAME BIRTHDATE 

For example: 

MATH JOHN DOE 5/16/65 

MATH II SALLY BUCK 4/21/67 

In the first line of each output file are the student's name, the 
teacher's name, the school, student's birthdate, and grade level. A 
period is used as a separator character inserted between each variable 
(see Figure 2 for an example output file of the 10 item Math II 
test). Following the first line, are the student responses, one line 
per question. Up to k_ responses, where Jc is the number of question 
alternatives, are stored on a line in the same order as the student 
selected them. The last response in each line is always the correct 
answer. An exclamation mark ends each line. In the event that the 
student does not respond at all to a question, then only an 
exclamation mark will appear on the data line. 

Following the response choices for the ji questions are the 
response times in seconds, that it takes the examinee to select a 
particular alternative. Again, there is one line allocated per 
question. There is a one-to-one correspondence between each line of 
responses and each line of response times. Within a line, response 



4o 



- 24 - 



Figure 2 

Example Examinee MATH II Output Produced 
by a Single Run of AUCMAIN Program, 

(Output File is Saved on Disk as MATH II YING LU 05/17/67.) 



Mr 



YINB LU .CHU. UCLA. 05/17/67. 7 
B! 

A! 

D'! 

B! 

BC! 

E! 

D! 
I 

C! 
C! 

21. ! 
7. ! 
23. ! 



22. 12. ! 

22. ! 
20. ! 
I 

23. ! 
42. ! 

1 ! 1 ! 1 ! 1 ! 1 ! 1 ! 1 ! ! 3 



48 



- 25 - 



times are separated by periods. Finally, in the last line of the 
output file are the confidence ratings for the first response to each 
question. Confidence ratings are only obtained for the student's 
first choice for each question, so there is only one rating per 
question. Ratings are separated by exclamation marks. It should be 
noted that in the example output file in Figure 2, the examinee did 
not respond to Question 8. 

At the same time the output file is saved, the file name 
(including the test name, student name, and birthdate), is appended to 
a master file which includes the names of all examinees taking the 
same test on the same disk (an example file is shown in Figure 3). 
There Is a master file for each available test named 

AUC( test name) 
For example: 

AUCMATH 

AUCMATH II 

The master files are subsequently used to concatenate all responses 
for all examinees on all disks into a single data file for the 
purposes of overall analysis of test responses. The program which 
has been developed to do this is called AUCFILE and is described 
bel ow. 



47 



- 26 - 



Figure 3 

Contents of AUCMATH II Master File of All Examinee 
Taking MATH II Test on a Single Disk. 



MATH II DELWIN CHIN APRIL IS 
MATH II SEAN MOORE 7/20/66 
MATH II FRANK DAMIANI S/S/66 
MATH II AARON SEELER 11/25/66 
MATH II PEDRAM MADDAHIAN 2 2 79 
MATH II ANNE HOLMES 9/2/66 
MATH II YINB LU 05/17/67 
MATH II SHARON SMASON 6/9/67 



48 



Concatenation of Files with the AUCFILE Program 

A program entitled, AUCFILE, has been created to concatenate the 
student files on a test into a single data file. After AUCFILE has 
been loaded into the computer, the disk or disks containing the 
student files are inserted into either Disk Drive I or II. When the 
AUCFILE program is run, the user is queried about which tests need to 
be concatenated. The program then uses the master files (created by 
the main program and updated with each test run) on the disks to 
control the reading and concatenating of student responses. The 
concatenated file is saved as (test name) DATA. For example: 

MATHDATA 
MATH I I DATA 

These files are always written to the disk in Drive I. 

The format of the file is such that responses, confidence 
ratings, and response times follow a fixed format. Each student has 
three records. The first record contains information about the 
student (name, teacher,, school, birthdate, and grade). The second 
record contains the item responses and the confidence ratings. A 
column is allocated for each item alternative. Once a correct answer 
is selected, blanks are inserted for the remaining distractors. 
Following an item's responses is the confidence rating for the item. 
The third record contains the response times for each alternative. 
Two columns are allocated for each alternative, so the maximum 
possible time is 99 seconds. As with Record Card 2, blanks are 
inserted for the remaining alternative choices after the correct 
answer is selected. An example concatenated file appears in Figure 4 

Er|c 4^ 



- 28 - 



Figure 4 

Contents of MATH IIDATA: Concatenated and Formatted Responses of 
Examinees Taking MATH' II Test from Several Different Disks 



IC 
12 



16 



2C ic 
25 



ID 



IC 



IC IC 
13 



ICD IBC 3BC 2BC 



ERIK KNUTZEN.CHU. UCLA. 8/13/65. 12 
B 2A ID IB IC IE ID 
32 29 32 15 

25 . 51 . 

DELWIN CHIN. GINETTE. SUMMER. APRIL 18.7 

B 2A ID 2B IC 2E ICD 3DC 2DC 3C 
08 13 20 16 09 07 

3105 50 

SEAN MOORE. 6INETTE. UCLA. 7/20/66. 1 1 

B 2A 2D 2B 2C 2E 2CBD 2BADEC2BDEC 3C 
14 11 29 20 

148120203 62 
FRANK DAMIANI . CHU. UCLA. 8/8/66. 1 1 
B 2A ID IB IC IE 
19 13 26 09 

22 92 

AARON SEELER. CHU. UCLA. 11/25/66. 10 
B lA ID IB IC IE 
08 06 14 22 

3408 2226 

PEDRAM MADDAHIAN.eiNETTE.UCLA.2 2 79.7 
B lA ID IB IC IE 
16 05 35 21 

3014 4532 

ANNE HOLMES. BINETTE. UCLA. 9/2/66. 1 1 
B lA ID IB IC IE ID 

45 21 54 13 

67 

YINB LU .CHU. UCLA. 05/ 17/67. 9 
B lA ID IB IBC IE ID 

21 07 23 23 

23 42 

SHARON SMASON. CHU. UCLA. 6/9/67. 9 
. e lA ID IB IBC IDE ID 
21 18 36 18 

241706 43 

MINB TSENG. CHU . UCLA. 01 /24/6a. ' 8' 
B IDBCA ID IB IC ICDE ICD 

19 12441212 46 17 

48140606013041 

SHEREE CHAN. CHU. UCLA. 10/31/66. 10 
B IDA ID IB IC IE ID 
05 0716 35 16 

2817 30 



24 



ICBAD IDC 
14 



IC 
13 



1 

2212 



06 



IBC 2BC 
15 



2D 1 

25 



•C 3C 



IBAC IBDC IC 
35.13 2717 



IDEBC lEDABClBC 
14 180906 



2BDC IBC IC 
07 12 



24 



1906 



371612 



19 



i6i; 



4B 



1712 



420504100 



45 



4309 



26453131 7318 



31 



20 



85 



2225 



21 



300710 



31270906 



012006 



• 29 - 



Creating New Files for Use as Input Tests to AUCMAIN Program 

A program named FILEWRITER has been created to create input files 
for the AUCMAIN program. If any new tests are to be input into the 
program, the following format must be followed: 



Line(s) Contents ^ 

1 Title of test 

2 Number of items 

3 Number of choices per item 

4-9 Directions for taking the test - up lo six lines 

long. Dununy characters must be typed in lines 
not occupied by directions, 

10 - - Start in line 10 the stem of question 1: Ql 



item stem) - continue on next line as needed, 
ach line should not exceed 34 spaces in length, 

- Response alternatievs must begin with an open 
parenthesis » (,: 

(A) (distractor) - continue on next lines as 
needed. Each line should not exceed 34 spaces. 

- The correct answer must follow the last 
distractor of each question, it must be starred: 
*B 

- After the correct answer, start the next question 
on the next line (Q. 2). 

- Repeat until all questions are typed in. 

- End the entire file with a *!'. 

The total possible lines for each question is 23 lines; within 
this limit, a stem or distractor can be up to 10 lines long. An 
example test following this format is presented in Figure 5. 
Unfortunately, one limitation of the FILEWRITER program is that commas 
may not be used anywhere in the file. 



- 30 - 



Figure 5 

Contents of MATH II: Test File Input for AUCMAIN Program 



MATH II 
10 
5 

CHOOSE THE BEST POSSIBLE ANSWER 
FOR THE FOLLOWING MATH QUESTIONS^ - 
YOU DO NOT NEED ANY MATERIAL OR 
CALCULATOR TO FIND THE CORRECT 
ANSWER. 
* 

Q.l ONE SET OF FACtORS FOR 56 IS 

<A) 2«3*7 

<B) 8*7 

<C) 2*26 

(D) 4*13 

<E) 9*6 

♦B 

Q.2 WHICH NUMBER IS THE MISSING 
FACTOR? 
2*2* *8 = 64 
<A) 2 

(B) 3 

(C) 5 

(D) 8 
<E) 12 
♦A 

Q.3 WHICH ONE OF THESE EQUATIONS 
IS TRUE? 



(A) 


(8*5) = (S+5) 




(B) 


(8+2) /4 = (4+2) /8 




(C) 


(6-2)*5 = (2*5) -6 




<D) 


(2+6) *5 = (5*8) 




(E) 


(5*6) +2 = (5*6) -2 




♦D 






Q.4 


WHAT IS THE MISSING NUS1BER 


IN THE SEQUENCE? 




35i 


31; 27; ;19; 




(A) 


24 




<B) 


23 




(C) 


15 




(D) 


14 




<E) 


11 




*B 






Q.5 


WHAT IS THE NEXT 


NUMBER IN 


THE 


SEQUENCE? 




3 J 3j 


4;5;5;6;7;7;8;9; 




<A) 


11 




<B) 


10 




(C) 


9 




(D) 


8 




<E) 


7 




*C 




52 



- 30-A - 



Q.6 ANOTHER WAY. TO REPRESENT 
647 IQ... 



(B) (6+4+7)*100 

(C) (6*10)+(4*10)+(6*10) 
<D) (6*10)-t'47 

(E) (6*100)+(4*10)+(7*1) 
♦E 

Q.7 WHICH OF THE FOLLOWING PERIOD 
OF TIME IS CLOSEST TO AN HOUR? 

(A) 23 MINUTES 50 SEC. 

(B) 36 MINUTES 58 SEC. 
<C) 43 MINUTES 10 SEC. 
<D) 71 MINUTES 12 SEC. 
(E) 99 MINUTES 2 SEC. 
«D 

Q.S MR. JONES LEAVES HIS HOUSE 
EVERY MORNING AT 6.30 A.M. TO 
60 TO WORK. HE HAS TO DRIVE 
72 MILES AND HIS CAR AVERAGES 
48 MILES AN HOUR. AT WHAT TIME 
DOES HE ARRIVE AT WORK? 

(A) 7.00 A.M. 

(B> 7.30 A.M. 

(C) 8.00 A.M. 

(D) 8.30 A.M. 

(E) 9.00 A.M. 
♦C 

Q.9 YOU HAVE TO BUY LEMONADE FOR A 
PARTY. EACH BOTTLE COSTB 75 CENTS. 
HOW MANY BOTTLES WILL YUU BE ABLE 
TO BUY IF YOU HAVE 10 DOLLARS TO 
SPEND? 

(A) 10 

(B) 12 

(C) 13 
<D) 14 
<E) 15 
♦C 

Q.10 LAST MONTH JIM WORKED 3 HOURS 
A DAY FOR 20 DAYS. HE WAS PAID 4 
DOLLARS AN HOUR. HE ALSO BOUGHT 
2 RECORDS FOR 8 DOLLARS EACH. HOW 
MUCH MONEY DOES HE HAVE LEFT? 

(A) 240 DOLLARS 

(B) 232 DOLLARS 

(C) 224 DOLLARS 

(D) 80 DOLLARS 

(E) 64 DOLLARS 



(A) 



6 -r 4 + 7 




53 



- 31 - 



REFERENCES 



Arnold, J.C. & Arnold, P.L. On scoring multiple-choice exams allowing 
for partial knowledge. The Journal of Experimental Education , 
1970, 39, 8-13. 

Coombs, C.H. On the use of objective examinations. Educational and 
Psychological Measurement , 1953, 13, 308-310. * 

Cross, L.H. & Thayer, N.F. A new method for administering and scoring 
multiple-choice tests: Theoretical and empirical 
considerations. Unpublished manuscript, Virginia Polytechnic 
Institute and State University, 1979. 

Ebel, R.L. Blind guessing on objective achievement tests. Journal of 
Educational Measurement , 1968, 5_, 321-325. 

Gilman, D. & Ferry, P. Increasing test reliability through 
self-scoring procedures. Journal of Educational 
Measurement , 1972, ±, 205-207. 

Hanna, G. Incremental reliability and validity of multiple-choice 
tests with an answer until correct procedure. Journal of 
Educational Measurement , 1975, 12j 175-178. 

Hritz, R.J. & Jacobs, S.S. Risk-taking and the assessment of partial 
knowledge. Paper presented at the Annual Meeting of the American 
Psychological Association, Miami Beach, Florida, September 1970. 

Kane, M. & Moloney, J. The effect of guessing on item reliability 
under answer until correct scoring. Applied Psychological 
Measurement , 1978, 41-49. 

Koriat, A., Lichtenstein, S., & Fischhoff, B. Reasons for 

confidence. Journal of Experimental Psychology: Human Learning 
and Memory , 1980, 6, 107-118. 

Rippey, R. & Oonato, J. Interactive confidence test scoring and 

interpretation. Educational and Psychological Measurement , 1978, 
38, 153-157. 

Shaughnessy, J. Confidence-judgment accuracy as a predictor of test 
performance. Journal of Research in Personality , 1979, 2lf 
504-514. 

Sieber, J. Confidence estimates on the correctness of constructed and 
multiple-choice responses. Contemporary Educational Psychology , 
1979, 4, 272-287. 



Taylor, J., West, D., & Tinning, F. An examination of decision-making 
based on a partial credit scoring system. Pijper presented at the 
Annual Meeting of the National Council on Measurement in 
Education, Washington, O.C., 1975. 

Wilcox, R.R. Solving measurement problems with an 

answer-until -correct scoring procedure. App lied Psychologic al 
M easurement , 1981, 5, 399-414. 

Wilcox,, R.R. Some new results on an answer-until -correct scoring 

procedure. Journa l of Educational Measuremen t, 1982, 19^, 67-74. 



5d 



